Computing the probability of a tree using the Gillespie algorithm

We want to compute the probability of alignment A,C,A at the tips of tree ((A:0.1, C:0.1):0.3, A:0.4);


In [1]:
import math
import numpy as np

# Useful function for drawing a DNA state, for instance at the root
def drawNewState():
    i = np.random.randint(0,4)
    if i == 0:
        return 'A'
    elif  i==1:
        return 'C'
    elif  i==2:
        return 'G'
    elif  i==3:
        return 'T'
    else  :
        print("Error drawNewState")
        exit(-1)

# Function to simulate along a branch using the Gillespie algorithm under the Jukes Cantor model (1969)
# In this case, we consider that there can be a mutation towards the same state.
# So the sum of the rates is 0.25*4=1.0
def simulateAlongBranch(length, startingState):
    l = 0.0
    current = startingState
    while (l < length):
        l = l + np.random.exponential(scale=1.0) # 1.0 is the sum of the rates of all possible events
        if l < length:
            current = drawNewState()
    return current

# Compute the probability of a tree
def computeTreeProbabilityBySimulating(bl_a1, bl_c, bl_a2, bl_y):
    Nsucces = 0
    Nreps = 100000
    iter = 0
    while iter < Nreps:
        x = drawNewState()
        y = simulateAlongBranch(bl_y, x)
        A1 = simulateAlongBranch(bl_a1, y)
        C = simulateAlongBranch(bl_c, y)
        A2 = simulateAlongBranch(bl_a2, x)
        if (A1=="A" and C=="C" and A2=="A"):
            Nsucces+=1
        iter = iter + 1
    return (Nsucces/Nreps)

if __name__ == '__main__':
    p = computeTreeProbabilityBySimulating(0.1, 0.1, 0.4, 0.3)
    print("Likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: "+ str(p))
    print("Log-likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: "+ str(math.log(p)))


Likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: 0.00416
Log-likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: -5.482240204708965

Comparison to likelihood computed with usual software

In common software, substitutions to the same state are not considered, and therefore branch lengths need to be rescaled. Indeed, branch lengths correspond to expected numbers of substitutions: in our case, we expect 4/3 times more substitutions than in common software. Therefore, to compare the likelihoods, we have to run common software on branch lengths that are 3/4 those of the tree, i.e.: ((A1:0.075, C:0.075):0.225, A2:0.3); When I do that, I get : loglk=-5.480185. So the Gillespie sampling approach is not too far from the Felsenstein algorithm exact approach.


In [ ]: